arXiv:1509.06882vl [cs.SD] 23 Sep 2015 


ROBUST COHERENCE-BASED SPECTRAL ENHANCEMENT FOR DISTANT SPEECH 

RECOGNITION 


Hendrik Barfuss, Christian Huemmer, Andreas Schwarz, and Walter Kellermann 


Multimedia Communications and Signal Processing, 
Friedrich-Alexander University Erlangen-Niimberg 
Cauerstr. 7, 91058 Erlangen, Germany 
{barfuss,huemmer,schwarz,wk}@lnt.de 


ABSTRACT 

In this contribution to the 3rd CHiME Speech Separation and 
Recognition Challenge (CHiME-3) we extend the acoustic 
front-end of the CHiME-3 baseline speech recognition sys¬ 
tem by a coherence-based Wiener filter which is applied to 
the output signal of the baseline beamformer. To compute the 
time- and frequency-dependent postfilter gains the ratio be¬ 
tween direct and diffuse signal components at the output of 
the baseline beamformer is estimated and used as approxi¬ 
mation of the short-time signal-to-noise ratio. The proposed 
spectral enhancement technique is evaluated with respect to 
word error rates of the CHiME-3 challenge baseline speech 
recognition system using real speech recorded in public envi¬ 
ronments. Results confirm the effectiveness of the coherence- 
based postfilter when integrated into the front-end signal en¬ 
hancement. 

Index Terms — Robust automatic speech recognition, 
Postfiltering, Spectral enhancement, Coherence-to-diffuse 
power ratio, Wiener filter 

1. INTRODUCTION 

Eor a satisfying user experience of human-machine interfaces 
it is crucial to ensure a high accuracy in automatically recog¬ 
nizing the user’s speech. As soon as no close-talking micro¬ 
phone is used, the recognition accuracy suffers from rever¬ 
beration as well as background noise and active interfering 
speakers picked up by the microphones in addition to the de¬ 
sired speech signal (DEI. Signal processing techniques for 
robust speech recognition in noisy environments can be cat¬ 
egorized into two major categories, namely front-end (e.g., 
speech enhancement |[3] [H |5l) and back-end (e.g., acoustic- 
model adaptation 0|7]|8]) processing techniques. 

The 3rd CHiME Speech Separation and Recognition 
Challenge (CHiME-3) @ targets the performance of state- 
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of-the-art Automatic Speech Recognition (ASR) systems in 
real-world scenarios. In this year’s challenge, the primary 
goal is to improve the ASR performance of real recorded 
speech of a person talking to a tablet device in realistic noisy 
environments by employing front-end and/or back-end signal 
processing techniques. 

In this contribution to the CHiME-3 challenge, we focus 
on front-end speech enhancement and extend the CHiME-3 
baseline front-end signal processing, consisting of a Min¬ 
imum Variance Distortionless Response (MVDR) beam- 
former, by a coherence-based postfilter. The postfilter is real¬ 
ized as a Wiener filter, where an estimate of the ratio between 
direct and diffuse signal components at the output of the 
baseline MVDR beamformer are used as an approximation 
of the short-time Signal-to-Noise Ratio (SNR) to compute 
the time- and frequency-dependent postfilter gains. The em¬ 
ployed postfilter is Direction-of-Arrival (DoA)-independent 
and has a low computational complexity. 

An overview of the overall signal processing pipeline is 
given in Eig.[T] Whereas the purpose of the beamformer is to 
reduce the signal components from interfering point sources 
by spatial filtering, the postfilter shall remove diffuse interfer¬ 
ence components, e.g., reverberation, from the beamformer 
output signal. The output of the front-end signal enhancement 
(consisting of MVDR beamformer and postfilter) is further 
processed by feature extraction/transformation and acous¬ 
tic modeling following the CHiME-3 baseline ASR system, 
which provides a Hidden Markov Model (HMM)-Gaussian 
Mixture Model (GMM)-based as well as an HMM-Deep 
Neural Network (DNN)-based speech recognizer asi- 

The remainder of this article is structured as follows: In 
Section |2] the proposed front-end signal enhancement is in¬ 
troduced in detail, followed by a brief review of the employed 
ASR system in Section [3] The performance of the front-end 
speech enhancement is evaluated with respect to word error 
rates (WERs) of the baseline ASR system, which are pre¬ 
sented in Section @1 A conclusion and an outlook to future 
work is given in Section|5] 




Fig. 1. Overview of the overall signal processing pipeline system with beamformer and postfilter as acoustic front-end signal 
processing. The acoustic back-end system, including feature extraction/transformation, is equal to the baseline acoustic back¬ 
end system provided by CHiME-3 a. 


2. FRONT-END ENHANCEMENT TECHNIQUES 

The front-end speech enhancement considered in this article 
consists of an MVDR beamformer (provided by the CHiME-3 
baseline) and a single-channel coherence-based postfilter. In 
the following, the baseline MVDR beamformer is briefly re¬ 
viewed, followed by a detailed presentation of the proposed 
postfilter. 

2.1. Signal model 

Eor a consistent presentation of the front-end speech en¬ 
hancement considered in this work, we first introduce a signal 
model which will be used throughout this article. 

The N microphone signals of the microphone array in the 
short-time Eourier transform (STET) domain at frame I and 
frequency / are given as: 

^{l,f) = hil,f)Sil,f) + n{l,f), (1) 

where vector 

x(Z, /) = [Xo{l, /), X,{1, /),..., f )f (2) 

contains the microphone signals, S{l,f) denotes the clean 
source signal, and n(Z, /) includes sensor noise as well as 
diffuse background noise components and is defined analo¬ 
gously to x(/, /) in (|2]l. Assuming free-held propagation of 
sound waves, h(/, /) represents the steering vector modeling 
the sound propagation between the desired source located at 
direction {(f)d,0d) and all N microphones: 

h{lj) = ..., (3) 

where wavevector kd is dehned as cni: 

kd =-^[sin(0d)cos((;id), sin(6»d)sin(())d), cos(6'd)]^, 

(4) 

with speed of sound c and operator (•)^ denoting the trans¬ 
pose of a vector or matrix, (jj and 9 denote azimuth and el¬ 
evation angle, respectively, and are dehned as in ifTOll with 
(</), 6>) = (90°, 90°) denoting broadside. Eurthermore, the n- 
th microphone position in Cartesian coordinates is captured 
by the three-dimensional vector p„, n G {0,..., N — 1}. 


The beamformer output Ybf(^j/) is obtained by multi¬ 
plying each microphone signal with a complex-valued hlter 
weight Wn{l,f), followed by a summation over all micro¬ 
phone channels: 

= (5) 

where 

^{l,f) = [Wo{l,f),...,WN-i{l,f)f (6) 

contains the beamformer hlter coefficients Wn{l, /)■ Subse¬ 
quently, the posthlter is applied to the beamformer output sig¬ 
nal, yielding the overall output signal 

Y{l,f) = G{lJ)YBF{lJ), (7) 

where G{1, /) describes the posthlter gains. After front-end 
signal enhancement, Y{1, f) is fed into the CHiME-3 baseline 
acoustic back-end system 13 . 

2.2. Minimum variance distortionless response beam- 
former 

The hlter weights of the MVDR beamformer are determined 
such that the power of the noise components at the output 
of the beamformer is minimized, subject to a distortionless 
constraint in target look direction. Thus, the constrained opti¬ 
mization problem of the MVDR beamformer is given as Qol 

wmvdr(^, /) = argminw^(/, /)S„„((, /)w((, /) (8) 

w(/,/) 

subject to 

w^((,/)d(/) = l, (9) 

where Snn{l,f) is the multichannel spatio-spectral covari¬ 
ance matrix of the noise components at the input of the beam- 
former, and vector d(/) in (|9]) represents the steering vec¬ 
tor corresponding to the beamformer’s desired look direction 
(<^d, 0d), dehned as 

d(/) = ^ ^ (jQ) 

Eq. ® represents the minimization of the noise variance at 
the output of the beamformer, whereas ® contains the dis¬ 
tortionless constraint which ensures that a plane wave coming 
















(16) 


CDR = 


Tn Re{fj - \r/ - Vr2 Re{fx}^ - r2 |fx|^ + ri-2T^ Re{fx} + \t/ 


irxl -1 


from the desired look direction {6^, (f>d) can pass the system 
without distortion. The optimum solution to the constrained 
optimization problem in ([S]),® is given as ifTOl 


'''"MVDr((i /) 


d^(/)S^iG,/)d(/)- 


( 11 ) 


The multichannel spatio-spectral noise-covariance matrix 
Snn(^ /) was estimated from a time interval of duration be¬ 
tween 400 ms and 800 ms immediately before each utterance 
0. As in the CHiME-3 baseline, all failing microphones are 
excluded from the beamforming. 

The DoA was determined by using the CHiME-3 base¬ 
line localization approach which uses a nonlinear SRP-PHAT 
pseudo spectrum 0. 


2.3. Coherence-based postfilter 

As illustrated in Pig.[Tl we apply a postfilter to remove diffuse 
noise components from the output of the MVDR beamformer. 
The postfilter gain G{1, /) at frame I and frequency / is given 
as in]: 

G(Z,/) (12) 

with overestimation factor /r, and gain floor Gmin- The post¬ 
filter in (fTSl i is a Wiener filter using the short-time SNR to 
compute the filter gains G{l,f). In this work, we approxi¬ 
mate the short-time SNR in by the estimated Coherent- 
to-Diffuse Power Ratio (CDR), which is the ratio between di¬ 
rect and diffuse signal components. Prom (fT^ it can be seen 
that a low CDR value, which corresponds to strong diffuse 
signal components being present at the input of the system, 
leads to low filter gains and vice versa. 

The CDR between two omnidirectional microphones is 
defined as IH: 


CDR((,/) 




(13) 


where rx((, /) is the spatial coherence function of both mi¬ 
crophone signals. Moreover, the spatial coherence functions 
for the direct and diffuse sound components are given as 

= (14) 

rn((,/) = rdiff(/) = sinc(27r/-), (15) 

c 

respectively, with Time Difference of Arrival (TDOA) At and 
microphone spacing d. 


Many different CDR estimators have been proposed in the 
literature, see, e.g., iniiiiEi- The CDR estimator we use 
in this work was proposed in IT^ and is given by (fThl l. where 
Re{-} and | • | represent the real part and magnitude of (•), 
respectively. Moreover, and CDK{l,f) are the esti¬ 

mated coherence and CDR of the two microphone signals, 
respectively. Note that I and / have been omitted in ( fTSl l 
for brevity. As can be seen from ( fThl l. the employed esti¬ 
mator does not require the DoA of the speech source, since 
Ts(l, /) is not required for calculating CDR((, /). In ifT^ it 
was shown that the employed estimator (fThl l is unbiased and 
robust in the sense that deviations of the coherence estimate 
rx((, /) from the assumed model do not lead to large devia¬ 
tions of the CDR estimate. A more detailed investigation of 
the employed CDR estimator (fThl) and a comparison to differ¬ 
ent estimators with respect to bias, robustness, and derever¬ 
beration performance, can be found in ifT^fT^ . 

When applying the coherence-based postfilter to the out¬ 
put of a beamformer, two aspects need to be considered: Eirst, 
since the microphone array of the CHiME-3 challenge con¬ 
sists of five forward-facing microphones, the CDR estima¬ 
tor (initially designed for a pair of microphones ) has to be 
adapted to exploit all available microphone signals. To do so, 
we apply the CDR estimator ( fTSI l to every pair of non-failing 
microphones, i.e., ten pairs for five microphones, to obtain the 
CDR estimate of each microphone pair. Erom each of these 
estimates, we calculate the respective diffuseness values as 

Gsini: 


D(/,/) = 


1 


(17) 


(1 + CDR((,/)) 

Subsequently, we take the arithmetic average of all micro¬ 
phone pair-specific diffuseness values, and calculate the final 
CDR estimate as 


where CDRin((,/) describes the final CDR estimate at the 
input of the system, and D(Z, /) denotes the average diffuse¬ 
ness obtained by calculating the mean of all microphone pair- 
specific diffuseness values. Second, note that the obtained 
CDR estimate CDRin((, /) is an estimate of the CDR at the 
input of the signal enhancement system, i.e., the beamformer. 
However, what we actually need is the CDR at the output of 
the beamformer. This can be obtained by applying a correc¬ 
tion factor Ar{l, f) to CDRin((, /). Thus, the CDR estimate 
at the output of the beamformer CDRbf((j /) is defined as 

f' n CDRin((,/) 

CDRbf((,/)- , (19) 










spectral enhancement 



Fig. 2. Illustration of the front-end signal processing consist¬ 
ing of beamforming and coherence-based postfilter which is 
applied to the beamformer output. 

where Ar{l, /) is given as ITSl 

Ar{l, /) = w^(Z, /)Jdiff(/)w(Z, /), (20) 

where Jdiff(/) is the spatial coherence matrix of a diffuse 
noise field. 

Fig. 12] shows the block-diagram of the employed front- 
end enhancement system, consisting of beamformer and 
coherence-based postfilter. 

3. BACK-END ACOUSTIC MODELING 

As indicated in Fig. (Tj we employ the acoustic back-end sys¬ 
tem provided by the CHiME-3 baseline ASR system. It pro¬ 
vides an HMM-GMM system, consisting of 2500 tied tri¬ 
phone HMM states which are modeled by 15000 Gaussians. 
The HMM-GMM system is designed to provide WERs at rel¬ 
atively low computational costs. In addition, an HMM-DNN 
ASR system providing state-of-the-art ASR performance is 
contained in the CHiME-3 baseline. It employs a seven-layer 
DNN with 2048 neurons per hidden layer and is based on 
the Kaldi toolkit ifTOl . The DNN training process includes 
pre-training using restricted Boltzmann machines, cross en¬ 
tropy training, and sequence discriminative training using the 
state-level minimum Bayes risk (sMBR) criterion. Eor a more 
detailed presentation of the baseline ASR systems, see 0. 

4. EXPERIMENTAL RESULTS 

In the following, we investigate the impact of our proposed 
front-end enhancement on the STET spectra of a noisy speech 
utterance, and evaluate the speech recognition accuracy of the 
front-end with respect to WERs using the CHiME-3 baseline 
ASR systems. 

4.1. Setup and parameters 

Eor all experiments, we use half-overlapping sine windows of 
1024 samples to obtain the complex-valued STET representa¬ 
tion of the signals, which is equal to the baseline processing 


presented in 0. The signals were processed at a sampling 
rate of 16 kHz. The DoA of the desired source, which is re¬ 
quired for the MVDR beamformer design, was obtained us¬ 
ing the baseline localization algorithm 0. Eor realizing the 
coherence-based postfilter, we chose gain floor Gnun = 0.1 
and overestimation factor /i = 1.3. The short-time coher¬ 
ence estimates /) were obtained by recursive averaging 
of the auto- and cross-power spectra with forgetting factor 
A = 0.68, as in MM. 

The ASR task included sets of real and simulated noisy 
utterances in four different environments; cafe (CAE), street 
junction (STR), public transport (BUS), and pedestrian area 
(RED). Eor each environment, a training set, a development 
set, and an evaluation set consisting of real and simulated data 
was provided 0. 

4.2. Illustration of front-end impact in the STET domain 

In Eig.[3] we illustrate the impact of the MVDR beamformer 
and the coherence-based postfllter on the STET spectra of a 
noisy utterance, with the number of frames I and frequency / 
on the horizontal and vertical axis, respectively. Note that 
the coarse temporal resolution of the STET spectra is due 
to the baseline block-processing. As a reference, the spec¬ 
trum of the close-talking microphone (channel 0) is shown in 
Eig. |3(a)| It contains the desired utterance plus little back¬ 
ground noise. The recorded desired signal is a male speaker 
saying “Our guess is no” in the cafe environment. The spec¬ 
trum of microphone 1 is illustrated in Eig. |3(b)[ As can be 
seen, low- as well as high-frequency noise is acquired by the 
microphone, whereas most of the noise is present in the fre¬ 
quency range of speech. Applying the baseline MVDR beam- 
former leads to a reduction of the interfering components, 
as illustrated in Eig. |3(c)| A comparison of Eig. |3(c)| with 
Eig. |3(d)| shows that applying the coherence-based postfllter 
to the MVDR beamformer output yields a significant reduc¬ 
tion of interference across the entire frequency range, but it 
also removes low-frequency components of the desired sig¬ 
nal. The estimated diffuseness Dbf{ 1, f) at the beamformer 
output is illustrated in Eig. |3(e)| Comparing Eigs. |3(e)| and 
|3(c)| shows that Dbv{1, /) is very low whenever the desired 
source is active, which is to be expected, since the CDR will 
be high whenever the desired source is active. A final com¬ 
parison of Eigs. |3(a)| and |3(d)| reveals the similarity between 
the front-end output signal Y{I, f) and the close-talking mi¬ 
crophone signal S(l^ /), which indicates the effectiveness of 
the proposed front-end signal enhancement technique. 

4.3. Evaluation of estimation accuracy 

Table [T] summarizes the average WERs (in %) of the baseline 
(MVDR) and the extended (MVDRh-PE) front-end enhance¬ 
ment obtained for the CHiME-3 baseline HMM-GMM and 
HMM-DNN ASR (termed HMM-DNNh-sMBR in the tables 







































Table 1. Average WERs (in %) obtained with the baseline (MVDR) and extended (MVDR+PF) front-end signal enhancement 
for the baseline HMM-GMM and HMM-DNN ASR systems. 


Acoustic model 

Test data 

Training data 

Development set 

Evaluation set 

Real data 

Sim. data 

Real data 

Sim. data 

HMM-GMM 

Noisy 

Noisy 

18.67 

18.07 

32.97 

21.89 

HMM-DNNh-sMBR 

16.70 

14.38 

34.53 

21.34 

HMM-GMM 

MVDR 

MVDR 

20.87 

9.67 

38.18 

10.99 

HMM-DNNh-sMBR 

17.70 

8.22 

33.88 

10.79 

HMM-GMM 

MVDRh-PF 

MVDRh-PF 

16.13 

11.55 

28.29 

12.87 

HMM-DNNh-sMBR 

14.97 

10.17 

28.68 

15.24 


Table 2. WERs (in %) obtained with the extended front-end signal enhancement for the baseline HMM-DNN ASR system in 
each scenario. _ 


Environment 

Development set 

Evaluation set 

Real data 

Sim. data 

Real data 

Sim. data 

BUS 

17.63 

8.94 

35.58 

11.52 

CAE 

14.65 

12.23 

32.69 

17.37 

PED 

12.97 

8.42 

26.61 

15.48 

STR 

14.64 

11.11 

19.85 

16.57 


to be consistent with 121) systems. The WERs were aver¬ 
aged over all four acoustic environments. In the first column 
the employed acoustic model is specified. The test and train¬ 
ing data sets are indicated in the second and third column, 
whereas the respective results for the development and eval¬ 
uation data set are given in the fourth and fifth column. The 
ASR systems have always been trained on the output signals 
of the applied front-end enhancement. As a reference, the 
first row in Table [T]contains the WERs obtained for the noisy 
unprocessed microphone signals. Note that the results in the 
case of no front-end enhancement (Noisy) and for the base¬ 
line MVDR beamformer (second row in Table [T]) only differ 
slightly from the presented results in 121. The slight devi¬ 
ations are due to random initialisation and machine-specific 
issues. 

When comparing the results of the HMM-GMM ASR 
system in the first and second row, one can observe that the 
baseline front-end enhancement only improves the WERs 
for simulated data. In the case of real data, the recognition 
accuracy of the baseline front-end processing is significantly 
worse than without front-end signal processing. Eor the 
HMM-DNN-based recognizer, significant WER improve¬ 
ments can be observed for simulated data, whereas for real 
data there is no clear advantage of the baseline front-end 
processing compared to no front-end processing. 

A comparison of the results for the HMM-GMM ASR 
system in the second and third row shows that applying the 
coherence-based postfilter to the MVDR beamformer output 


signal drastically decreases the average WER for real data 
with an improvement of 4.74 and 9.89 percentage points for 
the development and evaluation data set, respectively. It can 
also be seen that the WERs of the extended front-end are 
slightly increased for simulated data. The reason for this may 
be that the employed postfilter parameters fi and Gmin are 
suboptimal for the simulated data set. The results for the base¬ 
line (MVDR) and the proposed front-end (MVDRh-PF) ob¬ 
tained with HMM-DNN ASR system in the second and third 
row show the same tendencies. Our proposed front-end en¬ 
hancement yields significantly lower WERs for real data and 
a worse recognition accuracy for simulated data. In the case 
of real data, the WERs were decreased by 2.73 and 5.2 per¬ 
centage points for the development and evaluation data set, 
respectively, by applying the coherence-based postfilter. 

It is interesting to note that for our proposed front-end, the 
HMM-DNN ASR system only yields a better recognition per¬ 
formance than the HMM-GMM system for the development 
data, whereas for the real evaluation data the HMM-GMM 
ASR system achieves lower WERs. Especially for the sim¬ 
ulated evaluation data, the HMM-GMM ASR is superior to 
the HMM-DNN-based recognizer. One explanation for this 
phenomenon might be a suboptimal architecture of the DNN 
which we did not optimize as part of this contribution. Fi¬ 
nally, we can observe that only applying the postfilter to the 
MVDR output signal yields significantly lower WERs with 
both baseline ASR systems for real data, compared to the un¬ 
processed signal, which confirms the effectiveness of our pro- 












posed postfilter. 

In Table |2] the scenario-specific WERs of our proposed 
front-end enhancement obtained with the baseline HMM- 
DNN ASR system are provided. Judging from the obtained 
WERs, the BUS environments seems to be the most chal¬ 
lenging scenario for real data, whereas the highest WER for 
simulated data was obtained for the cafe scenario. 

5. CONCLUSION 

In this contribution to the CHiME-3 challenge, we proposed 
an extension of the baseline front-end speech enhancement 
by a coherence-based posthlter. The posthlter is realized as 
a Wiener hlter, where an estimate of the ratio between direct 
and diffuse signal components at the output of the baseline 
MVDR beamformer is used as an approximation of the short- 
time SNR to compute the filter gains. To estimate the ratio be¬ 
tween direct and diffuse signal components, we used a DoA- 
independent estimator, which can be efficiently realized since 
it only requires an estimate of the auto- and cross-power spec¬ 
tra at the microphone signals. As a consequence, the proposed 
posthlter has a very low computational complexity as well. 
Both the baseline and the extended front-end speech enhance¬ 
ment have been evaluated on real and simulated data with re¬ 
spect to WERs using the baseline HMM-GMM and HMM- 
DNN ASR systems. The results conhrmed that the proposed 
coherence-based posthlter signihcantly improves the recogni¬ 
tion accuracy of the enhanced speech compared to the MVDR 
beamformer when applied to real data. The improved recog¬ 
nition accuracy in addition to the low computational complex¬ 
ity makes the proposed posthlter very suitable for real-time 
robust distant speech recognition. Future work includes the 
analysis of the performance of DoA-dependent CDR estima¬ 
tors for the CHiME-3 data. Also combining DoA-dependent 
and DoA-independent CDR estimators in different frequency 
ranges will be investigated. Moreover, using spatial diffuse¬ 
ness features as an additional input to a DNN-based acoustic 
model, as proposed in ll20ll . is another avenue for future work. 
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Fig. 3. Illustration of impact of front-end signal process¬ 
ing on the recorded noisy microphone signal, with recorded 
close-talking desired signal S{1, f) in (a), microphone signal 
Xi{l, /) in (b), baseline beamformer output signal yBF(^, /) 
in (c), and posthlter output signal Y{l,f) in (d). Fig. (e) 
shows the diffuseness DBF{l,f) which was estimated from 
the beamformer output signal in (c), and which has been used 
to compute the posthlter gains. 
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